Feat/bias evaluator WinoBias (Gender) by chaitanyamedidar · Pull Request #83 · AOSSIE-Org/OpenVerifiableLLM

chaitanyamedidar · 2026-03-21T19:16:08Z

Addressed Issues:

Implement WinoBias Gender Bias Evaluator

directly implements the bias testing evaluation metric
listed as a success criterion in the project specification

Screenshots/Recordings:

Additional Notes:

The project motivation explicitly states that LLM providers have growing incentive to bias models in favour of their sponsors and advertisers.
Without a concrete measurement tool, the claim of an impartial, unbiased model cannot be verified. This PR implements the first piece of that measurement suite.

Structure:

openverifiablellm/eval/base.py : abstract BaseEvaluator interface
openverifiablellm/eval/perplexity.py : PerplexityEvaluator with compute_sentence_perplexity() static method, reused by bias evaluators
openverifiablellm/eval/bias/wino_bias.py : WinoBiasEvaluator measuring gender bias via perplexity comparison on WinoBias type1_pro vs type1_anti splits
openverifiablellm/eval/bias/init.py : exports WinoBiasEvaluator
openverifiablellm/eval/init.py : exports all evaluators
tests/test_eval.py : 9 tests, all passing, using patched load_dataset to avoid network calls in CI
pyproject.toml : added datasets as runtime dependency

Why WinoBias:
Gender bias is one of the most well-documented forms of systematic skew that emerges from biased training data, the exact problem this project addresses. WinoBias is publicly available on HuggingFace, consistent with the project's open data philosophy, and provides clean paired sentence comparisons making bias measurement interpretable.

Why this structure (eval/bias/ subpackage):
Each bias benchmark is its own independent class in its own file. This means future benchmarks like TruthfulQA for factual bias, PoliEval for political bias can be added and reviewed as separate PRs without any merge conflicts.
We can easily merge any benchmark independently.

Checklist

My code follows the project's code style and conventions
I have made corresponding changes to the documentation
My changes generate no new warnings or errors
I have joined the Discord server and I will share a link to this PR with the project maintainers there
I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

New Features
- Added perplexity evaluation for language models.
- Added gender-bias evaluation using the WinoBias benchmark.
- Introduced a common evaluation framework for extensible benchmarks.
Dependencies
- Added dataset handling library and updated developer tooling requirements.
Tests
- Added tests covering evaluator behavior and core computation.

coderabbitai · 2026-03-21T19:16:21Z

Warning

Rate limit exceeded

@chaitanyamedidar has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minutes and 6 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 424c03bd-923c-4ce6-9fda-a1f3764aa5b2

📥 Commits

Reviewing files that changed from the base of the PR and between 3e8addc and c6288c0.

📒 Files selected for processing (2)

openverifiablellm/eval/perplexity.py
tests/test_eval.py

Walkthrough

Added an evaluation framework: a BaseEvaluator abstract class, PerplexityEvaluator (sentence-level perplexity, uniform test model), and WinoBiasEvaluator (loads WinoBias splits, computes stereotype/anti-stereotype scores). Added package initializers, tests, and the datasets dependency.

Changes

Cohort / File(s)	Summary
Package Initializers `openverifiablellm/eval/__init__.py`, `openverifiablellm/eval/bias/__init__.py`	New module/package inits that re-export `PerplexityEvaluator` and `WinoBiasEvaluator` and define `__all__`.
Base Evaluator `openverifiablellm/eval/base.py`	New abstract `BaseEvaluator` with `evaluate(self, model, tokenizer) -> dict` interface and method docstrings.
Perplexity Evaluator `openverifiablellm/eval/perplexity.py`	New `PerplexityEvaluator(BaseEvaluator)` with init config (`benchmark`, `n_samples`, `stride`), `uniform_model` helper, `compute_sentence_perplexity` method, and `evaluate` that streams HF dataset and returns mean perplexity.
WinoBias Evaluator `openverifiablellm/eval/bias/wino_bias.py`	New `WinoBiasEvaluator` that loads `wino_bias` `type1_pro`/`type1_anti` splits, computes per-sentence perplexities (via `PerplexityEvaluator.compute_sentence_perplexity`), and returns `stereotype_score`, `anti_stereotype_score`, and `bias_score`.
Tests `tests/test_eval.py`	Added pytest suite covering `PerplexityEvaluator.uniform_model`, `compute_sentence_perplexity`, and `WinoBiasEvaluator.evaluate()` behavior, keys, numeric results, and `n_samples` limiting.
Project Config `pyproject.toml`	Added `datasets` to dependencies and introduced `[project.optional-dependencies]` dev extras (`pytest`, `ruff`).

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Evaluator as PerplexityEvaluator
    participant Dataset as HuggingFace Dataset
    participant Tokenizer
    participant Model

    User->>Evaluator: evaluate(model, tokenizer)
    Evaluator->>Dataset: load benchmark "test" split (stream)
    Dataset-->>Evaluator: rows

    loop per sample (up to n_samples)
        Evaluator->>Tokenizer: encode(text)
        Tokenizer-->>Evaluator: token_ids
        Evaluator->>Model: request logits (teacher-forced steps)
        Model-->>Evaluator: logits
        Evaluator->>Evaluator: compute log-softmax, NLL → sentence_ppl
    end

    Evaluator->>Evaluator: aggregate mean perplexity
    Evaluator-->>User: {perplexity: float}

sequenceDiagram
    actor User
    participant Evaluator as WinoBiasEvaluator
    participant Dataset as HuggingFace Dataset
    participant Tokenizer
    participant Perplexity as PerplexityEvaluator
    participant Model

    User->>Evaluator: evaluate(model, tokenizer)

    Evaluator->>Dataset: load_dataset("wino_bias", "type1_pro")
    Dataset-->>Evaluator: pro rows

    loop pro samples (up to n_samples)
        Evaluator->>Tokenizer: encode(text)
        Tokenizer-->>Evaluator: token_ids
        Evaluator->>Perplexity: compute_sentence_perplexity(model, token_ids)
        Perplexity->>Model: request logits
        Model-->>Perplexity: logits
        Perplexity-->>Evaluator: perplexity score
    end

    Evaluator->>Evaluator: compute stereotype_score (mean)

    Evaluator->>Dataset: load_dataset("wino_bias", "type1_anti")
    Dataset-->>Evaluator: anti rows

    loop anti samples (up to n_samples)
        Evaluator->>Tokenizer: encode(text)
        Tokenizer-->>Evaluator: token_ids
        Evaluator->>Perplexity: compute_sentence_perplexity(model, token_ids)
        Perplexity->>Model: request logits
        Model-->>Perplexity: logits
        Perplexity-->>Evaluator: perplexity score
    end

    Evaluator->>Evaluator: compute anti_stereotype_score (mean)
    Evaluator->>Evaluator: bias_score = abs(stereotype - anti_stereotype)
    Evaluator-->>User: {stereotype_score, anti_stereotype_score, bias_score}

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

Python Lang

Poem

🐰 I nibble tokens, count each score with glee,

Perplexity dances, bias numbers flee,
Wino and wiki summoned to test,
Evaluators hop, doing their best,
A rabbit's cheer for metrics set free.

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Feat/bias evaluator WinoBias (Gender)' directly addresses the main change: implementing a gender-bias evaluator using the WinoBias benchmark, which is the primary feature added across multiple modules.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/eval/bias/wino_bias.py`:
- Around line 83-86: The computation of bias_score uses stereotype_score and
anti_stereotype_score from _score_split, and when both return inf the
subtraction yields NaN; update the logic after computing stereotype_score and
anti_stereotype_score to detect when both are infinite (e.g., via math.isfinite
or math.isinf) and in that case set bias_score to a stable sentinel like
float("inf") instead of performing the subtraction; otherwise keep the existing
abs(stereotype_score - anti_stereotype_score) behavior so normal finite values
are unchanged.

In `@openverifiablellm/eval/perplexity.py`:
- Around line 29-43: The stride parameter is stored in __init__ but never used;
update the evaluator's sequence-processing logic (e.g., the method that prepares
token windows or computes perplexity—look for methods like evaluate, score, or
compute_perplexity) to respect self.stride when a text exceeds the model context
length by implementing sliding windows with overlap equal to self.stride (or
using stride to advance the window), ensure tokenization/truncation uses these
windows rather than a single chop, and keep the docstring and self.stride
assignment in sync; also add tests or a small example in the evaluate path to
verify long sequences are segmented with the configured stride.
- Around line 141-149: The loop in perplexity evaluation (in
openverifiablellm.eval.perplexity) checks self.n_samples before filtering empty
texts so blank rows consume the quota; change the logic to count only actual
evaluated samples by moving the n_samples check after the empty-text check (or
maintain an explicit evaluated counter that increments only when you call
compute_sentence_perplexity and append to scores). In practice, inside the
for-loop (where tokenizer.encode, compute_sentence_perplexity, and scores.append
are used), skip empty texts first, then check if the evaluated sample count (use
len(scores) or a separate counter) has reached self.n_samples and break,
ensuring compute_sentence_perplexity is only called for non-empty rows and the
n_samples limit counts evaluated samples.
- Around line 103-114: The code currently zips logits_batch and targets which
silently truncates when model output length is wrong; update the scoring logic
in perplexity.py (around the use of logits_batch, model, targets and the loop
computing nll_sum) to first validate that len(logits_batch) == len(targets) and
raise a clear exception (e.g., ValueError) showing both lengths if they differ,
and optionally also validate that each logits has length > max(target) or
matches expected vocab_size before computing log-probs; only proceed to the
existing loop and return math.exp(nll_sum / len(targets)) after these checks.

In `@pyproject.toml`:
- Around line 14-19: The dependencies list in pyproject.toml is missing core LLM
packages; add "numpy" and "torch" to the existing dependencies array (alongside
"datasets", "defusedxml", "sentencepiece", "tokenizers==0.15.2") so they are
installed as required dependencies; update the dependencies section to include
"numpy" and "torch" entries and ensure versions are specified if necessary for
compatibility.

In `@tests/test_eval.py`:
- Around line 48-52: The test helper _load currently returns anti_data for any
unexpected inputs which can mask integration bugs; update the mock _load
function to strictly validate its inputs (check the incoming name, config, and
split against the expected values) and raise an explicit exception (e.g.,
ValueError or AssertionError) when an unexpected name/config/split is passed
instead of defaulting to anti_data; adjust references to pro_data/anti_data
accordingly and ensure tests that expect fallback behavior are updated,
mentioning the mock function name _load (and any callers like evaluate) so
reviewers can locate the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9bab766c-34c7-479f-b7cb-b51e0c46b3bd

📥 Commits

Reviewing files that changed from the base of the PR and between 578bc79 and b3ef1fb.

📒 Files selected for processing (7)

openverifiablellm/eval/__init__.py
openverifiablellm/eval/base.py
openverifiablellm/eval/bias/__init__.py
openverifiablellm/eval/bias/wino_bias.py
openverifiablellm/eval/perplexity.py
pyproject.toml
tests/test_eval.py

openverifiablellm/eval/bias/wino_bias.py

openverifiablellm/eval/perplexity.py

coderabbitai · 2026-03-21T19:20:36Z

pyproject.toml

 dependencies = [
+    "datasets",
    "defusedxml",
    "sentencepiece",
    "tokenizers==0.15.2"
 ]


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "pyproject.toml" -type f

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 88

🏁 Script executed:

cat -n ./pyproject.toml

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1676

Add numpy and torch to required dependencies.

These are core LLM dependencies for this project and should be in the required dependency list, not optional.

Suggested patch

dependencies = [ "datasets", + "numpy", + "torch", "defusedxml", "sentencepiece", "tokenizers==0.15.2" ]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

dependencies = [

"datasets",

"defusedxml",

"sentencepiece",

"tokenizers==0.15.2"

]

dependencies = [

"datasets",

"numpy",

"torch",

"defusedxml",

"sentencepiece",

"tokenizers==0.15.2"

]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 14 - 19, The dependencies list in pyproject.toml is missing core LLM packages; add "numpy" and "torch" to the existing dependencies array (alongside "datasets", "defusedxml", "sentencepiece", "tokenizers==0.15.2") so they are installed as required dependencies; update the dependencies section to include "numpy" and "torch" entries and ensure versions are specified if necessary for compatibility.

tests/test_eval.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

openverifiablellm/eval/bias/wino_bias.py (1)
83-83: 🧹 Nitpick | 🔵 Trivial

Move import math to module level.

The math import is placed mid-function after the nested _score_split definition. Standard practice is to place imports at module level for clarity and to avoid repeated import overhead if evaluate() is called multiple times (though Python caches imports, the style is unconventional).
♻️ Suggested fix

Move the import to the top of the file:
 from typing import Optional
+import math
 
 from ..base import BaseEvaluator
Then remove line 83:
         return float(sum(scores) / len(scores)) if scores else float("inf")
 
-        import math
-
         stereotype_score = _score_split(pro_ds)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/eval/bias/wino_bias.py` at line 83, The import math is
inside the evaluate() function after the nested _score_split definition; move
the import to the module level (with the other imports at the top of the file)
and remove the in-function "import math" statement so evaluate() and
_score_split use the top-level import instead.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/eval/bias/wino_bias.py`:
- Around line 78-81: The current averaging in the block that calls
PerplexityEvaluator.compute_sentence_perplexity can produce float("inf") if any
sentence returns infinity; update the logic in the loop/return that collects
scores (where scores is appended using
PerplexityEvaluator.compute_sentence_perplexity(model, token_ids)) to filter out
math.isfinite or not-infinite values before computing the mean, and if no finite
scores remain return float("inf") (or keep the original fallback) so a single
infinite sentence doesn't make the whole split score infinite.

---

Duplicate comments:
In `@openverifiablellm/eval/bias/wino_bias.py`:
- Line 83: The import math is inside the evaluate() function after the nested
_score_split definition; move the import to the module level (with the other
imports at the top of the file) and remove the in-function "import math"
statement so evaluate() and _score_split use the top-level import instead.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 97aac695-5da9-424b-8626-283d689db63b

📥 Commits

Reviewing files that changed from the base of the PR and between b3ef1fb and 3e8addc.

📒 Files selected for processing (1)

openverifiablellm/eval/bias/wino_bias.py

coderabbitai · 2026-03-21T19:32:39Z

openverifiablellm/eval/bias/wino_bias.py

+                scores.append(
+                    PerplexityEvaluator.compute_sentence_perplexity(model, token_ids)
+                )
+            return float(sum(scores) / len(scores)) if scores else float("inf")


⚠️ Potential issue | 🟡 Minor

Consider filtering inf values before computing the mean.

If compute_sentence_perplexity returns float("inf") for any sentence (e.g., sequences with < 2 tokens), the entire split score becomes inf since sum([..., inf, ...]) is inf. While WinoBias sentences are typically well-formed, malformed or edge-case entries could skew the entire evaluation.

🛡️ Suggested defensive approach

- return float(sum(scores) / len(scores)) if scores else float("inf") + finite_scores = [s for s in scores if math.isfinite(s)] + return float(sum(finite_scores) / len(finite_scores)) if finite_scores else float("inf")

This filters out infinite values, computing the mean only over valid perplexity scores.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/eval/bias/wino_bias.py` around lines 78 - 81, The current averaging in the block that calls PerplexityEvaluator.compute_sentence_perplexity can produce float("inf") if any sentence returns infinity; update the logic in the loop/return that collects scores (where scores is appended using PerplexityEvaluator.compute_sentence_perplexity(model, token_ids)) to filter out math.isfinite or not-infinite values before computing the mean, and if no finite scores remain return float("inf") (or keep the original fallback) so a single infinite sentence doesn't make the whole split score infinite.

…in tests

c-sus-sus added 2 commits March 22, 2026 00:10

feat: implement BiasEvaluator using WinoBias benchmark

12d5d55

implement WinoBiasEvaluator as one of bias evaluation suite

b3ef1fb

github-actions bot added no-issue-linked backend configuration python size/L first-time-contributor pending-coderabbit-review labels Mar 21, 2026

chaitanyamedidar changed the title ~~Feat/bias evaluator~~ Feat/bias evaluator WinoBias (Gender) Mar 21, 2026

github-actions bot added size/L and removed size/L labels Mar 21, 2026

coderabbitai bot requested changes Mar 21, 2026

View reviewed changes

github-actions bot added size/L and removed size/L labels Mar 21, 2026

Avoid NaN bias scores when both split scores are infinite

3e8addc

github-actions bot added size/L and removed size/L labels Mar 21, 2026

coderabbitai bot requested changes Mar 21, 2026

View reviewed changes

github-actions bot added size/XL and removed size/L labels Mar 21, 2026

added three tests

33a36dd

github-actions bot added size/XL and removed size/XL labels Mar 21, 2026

validate logit shapes, fix n_samples blank-row counting, strict mock …

c6288c0

…in tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/bias evaluator WinoBias (Gender)#83

Feat/bias evaluator WinoBias (Gender)#83
chaitanyamedidar wants to merge 5 commits intoAOSSIE-Org:mainfrom
chaitanyamedidar:feat/bias-evaluator

chaitanyamedidar commented Mar 21, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 21, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 21, 2026

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

chaitanyamedidar commented Mar 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Implement WinoBias Gender Bias Evaluator

Screenshots/Recordings:

Additional Notes:

Checklist

⚠️ AI Notice - Important!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chaitanyamedidar commented Mar 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 21, 2026 •

edited

Loading